The ACL RD-TEC: A Dataset for Benchmarking Terminology Extraction and Classification in Computational Linguistics
نویسندگان
چکیده
This paper introduces ACL RD-TEC: a dataset for evaluating the extraction and classification of terms from literature in the domain of computational linguistics. The dataset is derived from the Association for Computational Linguistics anthology reference corpus (ACL ARC). In its first release, the ACL RD-TEC consists of automatically segmented, part-of-speech-tagged ACL ARC documents, three lists of candidate terms, and more than 82,000 manually annotated terms. The annotated terms are marked as either valid or invalid, and valid terms are further classified as technology and non-technology terms. Technology terms signify methods, algorithms, and solutions in computational linguistics. The paper describes the dataset and reports the relevant statistics. We hope the step described in this paper encourages a collaborative effort towards building a full-fledged annotated corpus from the computational linguistics literature.
منابع مشابه
The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods
This paper introduces the ACL Reference Dataset for Terminology Extraction and Classification, version 2.0 (ACL RD-TEC 2.0). The ACL RD-TEC 2.0 has been developed with the aim of providing a benchmark for the evaluation of term and entity recognition tasks based on specialised text from the computational linguistics domain. This release of the corpus consists of 300 abstracts from articles in t...
متن کاملEvaluating the Reliability and Interaction of Recursively Used Feature Classes for Terminology Extraction
Feature design and selection is a crucial aspect when treating terminology extraction as a machine learning classification problem. We designed feature classes which characterize different properties of terms, and propose a new feature class for components of term candidates. By using random forests, we infer optimal features which are later used to build decision tree classifiers. We evaluate ...
متن کاملHot Topics and Schisms in NLP: Community and Trend Analysis with Saffron on ACL and LREC Proceedings
In this paper we present a comparative analysis of two series of conferences in the field of Computational Linguistics, the LREC conference and the ACL conference. Conference proceedings were analysed using Saffron by performing term extraction and topical hierarchy construction with the goal of analysing topic trends and research communities. The system aims to provide insight into a research ...
متن کاملSemantic Annotation of the ACL Anthology Corpus for the Automatic Analysis of Scientific Literature
This paper describes the process of creating a corpus annotated for concepts and semantic relations in the scientific domain. A part of the ACL Anthology Corpus was selected for annotation, but the annotation process itself is not specific to the computational linguistics domain and could be applied to any scientific corpus. Concepts were identified and annotated fully automatically, based on a...
متن کامل